Goto

Collaborating Authors

 iterate convergence


Last iterate convergence of SGD for Least-Squares in the Interpolation regime.

Neural Information Processing Systems

Motivated by the recent successes of neural networks that have the ability to fit the data perfectly \emph{and} generalize well, we study the noiseless model in the fundamental least-squares setup. We assume that an optimum predictor perfectly fits the inputs and outputs $\langle \theta_*, \phi(X) \rangle = Y$, where $\phi(X)$ stands for a possibly infinite dimensional non-linear feature map. To solve this problem, we consider the estimator given by the last iterate of stochastic gradient descent (SGD) with constant step-size. In this context, our contribution is two fold: (i) \emph{from a (stochastic) optimization perspective}, we exhibit an archetypal problem where we can show explicitly the convergence of SGD final iterate for a non-strongly convex problem with constant step-size whereas usual results use some form of average and (ii) \emph{from a statistical perspective}, we give explicit non-asymptotic convergence rates in the over-parameterized setting and leverage a \emph{fine-grained} parameterization of the problem to exhibit polynomial rates that can be faster than $O(1/T)$. The link with reproducing kernel Hilbert spaces is established.


Last iterate convergence of SGD for Least-Squares in the Interpolation regime.

Neural Information Processing Systems

Motivated by the recent successes of neural networks that have the ability to fit the data perfectly \emph{and} generalize well, we study the noiseless model in the fundamental least-squares setup. We assume that an optimum predictor perfectly fits the inputs and outputs \langle \theta_*, \phi(X) \rangle Y, where \phi(X) stands for a possibly infinite dimensional non-linear feature map. To solve this problem, we consider the estimator given by the last iterate of stochastic gradient descent (SGD) with constant step-size. In this context, our contribution is two fold: (i) \emph{from a (stochastic) optimization perspective}, we exhibit an archetypal problem where we can show explicitly the convergence of SGD final iterate for a non-strongly convex problem with constant step-size whereas usual results use some form of average and (ii) \emph{from a statistical perspective}, we give explicit non-asymptotic convergence rates in the over-parameterized setting and leverage a \emph{fine-grained} parameterization of the problem to exhibit polynomial rates that can be faster than O(1/T) . The link with reproducing kernel Hilbert spaces is established.


A General Framework for Analyzing Stochastic Dynamics in Learning Algorithms

Chou, Chi-Ning, Sandhu, Juspreet Singh, Wang, Mien Brabeeba, Yu, Tiancheng

arXiv.org Artificial Intelligence

One of the challenges in analyzing learning algorithms is the circular entanglement between the objective value and the stochastic noise. This is also known as the "chicken and egg" phenomenon and traditionally, there is no principled way to tackle this issue. People solve the problem by utilizing the special structure of the dynamic, and hence the analysis would be difficult to generalize. In this work, we present a streamlined three-step recipe to tackle the "chicken and egg" problem and give a general framework for analyzing stochastic dynamics in learning algorithms. Our framework composes standard techniques from probability theory, such as stopping time and martingale concentration. We demonstrate the power and flexibility of our framework by giving a unifying analysis for three very different learning problems with the last iterate and the strong uniform high probability convergence guarantee. The problems are stochastic gradient descent for strongly convex functions, streaming principal component analysis, and linear bandit with stochastic gradient descent updates. We either improve or match the state-of-the-art bounds on all three dynamics.


Last iterate convergence in no-regret learning: constrained min-max optimization for convex-concave landscapes

Lei, Qi, Nagarajan, Sai Ganesh, Panageas, Ioannis, Wang, Xiao

arXiv.org Machine Learning

In a recent series of papers it has been established that variants of Gradient Descent/Ascent and Mirror Descent exhibit last iterate convergence in convex-concave zero-sum games. Specifically, \cite{DISZ17, LiangS18} show last iterate convergence of the so called "Optimistic Gradient Descent/Ascent" for the case of \textit{unconstrained} min-max optimization. Moreover, in \cite{Metal} the authors show that Mirror Descent with an extra gradient step displays last iterate convergence for convex-concave problems (both constrained and unconstrained), though their algorithm does not follow the online learning framework; it uses extra information rather than \textit{only} the history to compute the next iteration. In this work, we show that "Optimistic Multiplicative-Weights Update (OMWU)" which follows the no-regret online learning framework, exhibits last iterate convergence locally for convex-concave games, generalizing the results of \cite{DP19} where last iterate convergence of OMWU was shown only for the \textit{bilinear case}. We complement our results with experiments that indicate fast convergence of the method.


Combining No-regret and Q-learning

Kash, Ian A., Sullins, Michael, Hofmann, Katja

arXiv.org Artificial Intelligence

Combining No-regret and Q-learning Ian A. Kash University of Illinois, Chicago, IL Michael Sullins University of Illinois, Chicago, IL Katja Hofmann Microsoft Research, Cambridge, UK Abstract Counterfactual Regret Minimization (CFR) has found success in settings like poker which have both terminal states and perfect recall. We seek to understand how to relax these requirements. As a first step, we introduce a simple algorithm, local no-regret learning (LONR), which uses a Q-learning-like update rule to allow learning without terminal states or perfect recall. We prove its convergence for the basic case of MDPs (and limited extensions of them) and present empirical results showing that it achieves last iterate convergence in a number of settings, most notably NoSDE games, a class of Markov games specifically designed to be challenging to learn where no prior algorithm is known to achieve convergence to a stationary equilibrium even on average. 1 Introduction V ersions of counterfactual regret minimization (CFR) [50] have found success in playing poker at human expert level [10, 41] as well as fully solving nontrivial versions of it [8]. CFR more generally can solve extensive form games of incomplete information. It works by using a no-regret algorithm to select actions. In particular, one copy of such an algorithm is used at each information set, which corresponds to the full history of play observed by a single agent. The resulting algorithm satisfies a global no-regret guarantee, so at least in two-player zero-sum games is guaranteed to converge to an optimal strategy through sufficient self-play. However, CFR does have limitations. It makes two strong assumptions which are natural for games such as poker, but limit applicability to further settings. First, it assumes that the agent has perfect recall, which in a more general context means that the state representation captures the full history of states visited (and so imposes a tree structure). Current RL domains may rarely repeat states due to their large state spaces, but they certainly do not encode the full history of states and actions. Second, it assumes that a terminal state is eventually reached and performs updates only after this occurs.


Last-Iterate Convergence: Zero-Sum Games and Constrained Min-Max Optimization

Daskalakis, Constantinos, Panageas, Ioannis

arXiv.org Machine Learning

Motivated by applications in Game Theory, Optimization, and Generative Adversarial Networks, recent work of Daskalakis et al. and Liang and Stokes has established that a variant of the widely used Gradient Descent/Ascent procedure, called "Optimistic Gradient Descent/Ascent (OGDA)", exhibits last-iterate convergence to saddle points in {\em unconstrained} convex-concave min-max optimization problems. We show that the same holds true in the more general problem of {\em constrained} min-max optimization under a variant of the Multiplicative-Weights-Update method called "Optimistic Multiplicative-Weights Update (OMWU)". The generality of the constrained problem, which in particular captures all Linear Programming, requires fundamentally different techniques for analyzing the progress of OMWU towards min-max solutions. We show that OMWU monotonically improves the Kullback-Leibler divergence of the current iterate to the (appropriately normalized) min-max solution until it enters a neighborhood of the solution. Inside that neighborhood we show that OMWU becomes a contracting map converging to the exact solution. We experiment with zero-sum games to measure how the convergence rate scales with the dimension.